Eigenvectors for clustering: 
Unipartite, bipartite, and directed graph cases 
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Abstract: This paper presents a concise tutorial on 
spectral clustering for broad spectrum graphs which 
include unipartite (undirected) graph, bipartite graph, 
and directed graph. We show how to transform bipartite 
graph and directed graph into corresponding unipartite 
graph, therefore allowing a unified treatment to all cases. 
In bipartite graph, we show that the relaxed solution to 
the if -way co-clustering can be found by computing the 
left and right eigenvectors of the data matrix. This gives 
a theoretical basis for K-way spectral co-clustering al- 
gorithms proposed in the literatures. We also show that 
solving row and column co-clustering is equivalent to 
solving row and column clustering separately, thus giv- 
ing a theoretical support for the claim: "column cluster- 
ing implies row clustering and vice versa" . And in the 
last part, we generalize the Ky Fan theorem — which is 
the central theorem for explaining spectral clustering — 
to rectangular complex matrix motivated by the results 
from bipartite graph analysis. 

Keywords: eigenvectors, graph clustering, Ky Fan the- 
orem, spectral methods. 



1 Introduction 

Many papers have been written to reveal the secret and 
power of spectral clustering; the using of eigenvectors of 
an affinity matrix induced from a graph to find natural 
grouping of the vertices. Some noteworthy works are 
[TJ [21 131 111 [S] , and a comprehensive tutorial can be found 
in [B]. Despite being intensively studied, it is quite hard 
to find an intuitive and concise explanation on how and 
why the spectral clustering works. So, the logic behind 
the spectral clustering will be explained first. 

Most works deal with bipartite data clustering since 
many real datasets such as a collection of documents, 
movie ratings, and experimental samples are bipartite. 
The usual approach for this case is to transform the 
feature-by-item rectangular matrix induced from a bi- 
partite dataset into a corresponding symmetric matrix 



by using a kernel function. Then, a similar treatment as 
in unipartite graph can be employed to this symmetric 
matrix to find the clusters. 

However, simultaneous row and column clustering (co- 
clustering) works in the original data matrix, hence, the 
above approach will not work. In subsection l4.2l we show 
that the co-clustering problem can be restated into the 
clustering of bipartite graph with two type of vertices — 
item vertices and feature vertices — where the induced 
affinity matrix is symmetric. Thus, various clustering 
algorithms built for unipartite graph can be employed 
directly. 

In directed graph, usually edge directions are ignored 
to get an equivalent unipartite graph representation. 
However as noted in [7]: ignoring the edge directions 
can lead to a poor result, and a significant improvement 
can be achieved by counting for the edge directions into 
the model. As rows and columns of the induced affinity 
matrix of a directed graph correspond to the same set 
of vertices with the same order, as long as the cluster- 
ing problem is concerned, a symmetric matrix can be 
formed by simply adding the matrix to its transpose. 
Therefore, allowing similar treatment as in unipartite 
graph. We will discuss this more details in subsection 
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denotes an N xK complex 
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^NxK denotes an N x K real matrix, R^""^ de- 



sNxK 



A note on notation, 
matrix, 

notes an N X K nonnegative real matrix, B™ ^^ denotes 
an A^ X il' binary matrix, fc G [i-,K] denotes k = 1, . . . ,K, 
and whenever complex matrix is concerned, transpose 
operation refers to conjugate transpose. 

2 The Ky Fan theorem 

The Ky Fan theorem [5] relates eigenvectors of a Her- 
mitian matrix to the trace maximization problem of the 
matrix. 

Theorem 1. The optimal value of the following prob- 



lem: 



max tr(X^HX) 
xrx=iK 



K 



is equal to J2k=i '^^ */ 



X = [ui,...,uk]Q, 



(1) 



(2) 



where H G 



■^NxN 



denotes a full rank Hermitian matrix 



with eigenvalues Ai > ... > Xjy £ M+j ^ l£ K < N , 
X e ([^NxK dgjiQigg d unitary matrix, Ik denotes a K x 
K identity matrix, u^ e C^ denotes k-th eigenvector 
corresponds to Xk, and Q G (£KxK ^fg^jo^gs ^ji arbitrary 
unitary matrix. 

The solution to eq. [1] is not unique since X remains 
equally good for arbitrary rotation and reflection due 
to the existence of unitary matrix Q. However, since 
[ui , . . . , uk] is one of the optimal solution, setting X = 
[ui, . . . , uk] eventually leads to the optimal value. 



If H 



W e 



r,NxN 



where W denotes a symmet- 



ric affinity matrix induced from a graph, and X is con- 
strained to be nonnegative while preserving the orthogo- 
nality, i.e., X'^X = Ifc, then problem in eq,[l] turns into 
K-way graph cuts problem. Therefore, the Ky Fan theo- 
rem can be viewed as a relaxed version of the graph cuts. 
This relationship explains the logic behind the spectral 
clustering, where an orthogonal nonnegative clustering 
indicator matrix is derived by computing the first K 
eigenvectors of W. 

Eigenvectors of a matrix can be computed by using 
singular value decomposition (SVD). Hence, SVD will 
be discussed in the following section as many algorithms 
and software for computing SVD are available for use. 

3 Singular value decomposition 

SVD is a matrix decomposition technique that factorizes 
a matrix into a combination of left eigenvectors, right 
eigenvectors, and eigenvalues. SVD of a full rank matrix 
A e C*^^^ is defined as: 



min(Af,Af) 
fc=l 



Or, in a more compact form can be written as: 
A = USV^, 



(3) 



(4) 



where U e ([^mxm _ [ui,...,um] denotes an orthog- 
onal matrix contains the left singular vectors of A, 
V G C^^^ = [vi , . . . , vtv] denotes an orthogonal matrix 
contains the right singular vectors of A, and S G ]R_|_ ^ 
denotes a diagonal matrix contains the singular values 
(Ti > . . . > o'niin(M.7V) of A along its diagonal. 



In practice, usually vank-K approximation of A is 
used instead: 

Aa' = Ua'SkV^, (5) 

where usually K <^ min(M, iV), XJk and Vk contain 
the first K columns of U and V respectively, and S^- 
denotes a K x K principal submatrix of S. Accord- 
ing to Eckart- Young theorem, Aa is the closest rank- iiT 
approximation of A [S] . 

In the following section we show how to modify graph 
clustering objectives into trace maximization of corre- 
sponding symmetric matrices. And by relaxing the non- 
negativity constraints, according to the Ky Fan theorem, 
clustering problems eventually become the tasks of find- 
ing the first K eigenvectors of the matrices, which are 
exactly the SVD problems. 

4 Graph clustering 

Graphs usually can be represented by symmetric, recta- 
ngular, or square affinity matrices. A collection of items 
connected by weighted edges describing similarities bet- 
ween item pairs like a friendship network can be modeled 
by a unipartite graph, then a symmetric affinity matrix 
can be induced from this graph. A collection of doc- 
uments (and in general any bipartite dataset) can be 
modeled by a bipartite graph, and a term-by-document 
rectangular matrix containing (adjusted) frequencies of 
those terms in the documents can be constructed. And a 
square affinity matrix can be induced from a (unipartite) 
directed graph like WWW network. 

Let 5(A) = 0{V,£,A) be the graph representation 
of a collection with V denotes the set of vertices, £ de- 
notes the set of edges connecting vertex pairs, and A 
denotes the induced affinity matrix. The K-way graph 
clustering is the problem of finding the best cuts on 
Q (A) that maximize within cluster association, or equiv- 
alently, minimize inter cluster cuts to produce K clusters 
of V. 

Here we state two assumptions to allow the graph cuts 
be employed in clustering. 



Assumption 1. Let eij he 



connecting vertex Vi 



to Vj, the weight value \eij\ denotes the similarity bet- 
ween Vi and Vj linearly, i.e., if \eij\ = n\eik\ then Vi is 
n times more similar to Vj than to Vk. And zero weight 
means no similarity. 

Note that similarity term has many interpretations 
depending on the domain. For example, in the city 
road network the similarity can refer to the distance; the 
closer the distance between two points, the more similar 
those points are. And in the movie ratings, the similar- 
ity can refer to the number of common movies rated by 
the users. 



Assumption 2. Graph clustering refers to hard cluster- 
ing, i.e., for {Vfclf^i C V, uf^iVfc = V, and Vfe n V/ = 

yk^i. 

Proposition 1. Assumption[l\and\^lead to the group- 
ing of similar vertices in Q (A) . 

Proof. Consider Q (A) to be clustered into K groups by 
initial random assignments. Since assumption [l] guaran- 
tees \eij\ to be comparable, and assumption[2]guarantees 
each vertex to be assigned only to a single cluster, clus- 
ter assignment for Vi (zik) can be found by finding a 
cluster's center that is most similar to Vj. 



Zik = arg rnax 

k 



E 



yjk=Vj^eVk 



\Vk\ 



ke[i,K]\, (6) 



where |Vfc| denotes the size of cluster k. The objective in 
eq. [6]is the K-means clustering applied to Q (A), there- 
fore leads to the grouping of similar vertices. D 

Note that assumption [T] is an ideal situation which 
generally doesn't hold. For example, in bipartite repre- 
sentation of a tcrm-by-document matrix, usually the re- 
lationships between term-document pairs are not linear 
to the corresponding term frequencies. Therefore, pre- 
processing steps (e.g., feature selection and term weight- 
ing) are usually necessary before applying the graph 
cuts. The preprocessing steps seem to be very crucial 
for obtaining good results [TUj, and many works are de- 
voted to find more accurate similarity measures schemes 

[iniiniiiiiis]. 

Even though the similarities have been reflected by 
the weights in (almost) linear fashion, a normalization 
scheme on A generally is preferable to produce balance- 
size clusters. In fact, normalized association/cuts ob- 
jectives are proven to offer better results compared to 
their unnormalized counterparts, ratio association/cuts 
objectives dig [14]. 

Table [1] shows the most popular graph clustering ob- 
jectives with the first two objectives are from the work 
of Dhillon et al. [H]. GWAssoc (GWCuts) refers to gen- 
eral weighted association (cuts), NAssoc [NCuts) refers 
to normalized association (cuts), and RAssoc (RCuts) 
refers to ratio association (cuts). Since all other objec- 
tives can be derived from GWAssoc |14j . we will only 
consider GWAssoc for the rest of this paper. 

4.1 Unipartite graph clustering 

Unipartitc graph is the framework for deriving a unified 
treatment for the three graphs, so we discuss it first. The 
following proposition summarizes the effort of Dhillon et 
al. J14j in providing a general unipartite graph clustering 
objective. 



Proposition 2. Unipartite graph clustering can be 
stated in the trace maximization problem of a symmetric 
matrix. 

Proof. Let W G W^ ^ ^ be the symmetric affinity matrix 
induced from a unipartite graph, K-way partitioning on 
G (W) using GWAssoc can be found by: 



max Ju = ^ 



^ -i^Wz, 



K f-{ Zi *Zfe 



(7) 



where # G 



pNxN 



denotes a diagonal matrix with $i. 



associated with weight of Vi, and Zj. G B^ denotes a 
binary indicator vector for cluster k with its i-th entry 
is 1 if Vi in cluster k, and otherwise. 

The objective above can be rewritten more compactly 
in the trace maximization as: 



max J„ = — tr 
K 



Z^WZ 



z^*z 

= -|tr (z^*-i/2w*-i/2z 



(8) 



where Z £ B^ ^ = [zi, . . . ,zk] denotes the cluster- 



ing indicator matrix, and Z G ^i^^ 
denotes its orthonormal version. 



NxK 



Z/ (Z^Z 



1/2 

D 



By relaxing the strict nonnegativity constraints, i.e., 
allowing Z to contain negative values while preserving 
its orthonormality, according to the Ky Fan theorem, 
the global optimum of J„ can be obtained by assigning 



Z = [ui,...,uk]Q, 



(9) 



where Ui, . . . , u^^: G C^ denote the first K eigenvectors 
of f.-i/2-w$-i/2^ 2 G C^^^ denotes a relaxed version 
of Z, and Q G M^^^ denotes an arbitrary orthonormal 
matrix. Hence, eq. [9] presents a tractable solution for 
NP-hard problem in eq. [H 

The GWAsssoc objective in eq. |H] can be replaced 
by any objective in table [T] by substituting W and $ 
with corresponding affinity and weight matrices. Note 
that I denotes the identity matrix, D G M.^"'^ denotes 
a diagonal matrix with its diagonal entries defined as 
Da = J2i ^ij^ and L = D — W denotes the Laplacian 
ofg(W). 

4.2 Bipartite graph clustering 

Bipartite graph clustering generally refers to the cluster- 
ing of bipartite datasets — collections of items that are 
characterized by some shared features. A feature-by- 
item rectangular data matrix A G M_|_ ^ contains en- 
tries that describe the relationships between items and 
features. 



Table 1: Graph clustering objectives. 



Objective 


Affinity matrix 


Weiglit matrix 


GWAssoc 


W 


# 


GWCuts 


#-L 


* 


N Assoc 


W 


D 


NCuts 


D-L 


D 


R Assoc 


W 


I 


RCuts 


I L 


I 



Bipartite graph clustering can be done in two different 
ways; direct and indirect way. The former method ap- 
plies the graph cuts directly to Q (A) resulting in parti- 
tions that contain both item and feature vertices. And 
the latter method first transforms Q (A) into an equiv- 
alent unipartite graph (either item or feature graph) by 
calculating similarities between vertex pairs from either 
item or feature set, and then applies the graph cuts on 
this unipartite graph. Both methods lead to symmetric 
affinity matrices, thus GWAssoc objective can be ap- 
plied as in the unipartite graph case equivalently. 

4.2.1 Direct treatment 

If GWAssoc is applied to a bipartite graph, similar items 
will be grouped together with relevant features. This is 
known as simultaneous feature and item clustering or 
CO- clustering. 

Proposition 3. Bipartite graph co-clustering can be 
stated in the trace maximization problem of a symmetric 
matrix. 



Proof. Let M e M^^-^ (P 



tPxP ^ D _ j\^ _^ jY) bg ^jjg symmetric 
affinity matrix induced from a bipartite graph. M is 
defined as: 



M = 





A^ 



A 





(10) 



Taking GWAssoc as the objective, i^-way co-clustering 
can be found by: 



max Jh 



1 ^ 

-T 



zlyizk 



K ^—' zT^Zh 

k=l k K 



Then, eq. [Tljcan be rewritten as: 

max Jb = -^tr (tF ^'^''^M.^-^/'^l 



(11) 



(12) 



where * G RJ""^, Zfc £ B^, and Z G R^""-^' are defined 



equivalently as in the unipartite graph case. 



D 



By relaxing the nonnegativity constraints on Z, the 
optimum value of eq. [T2]can be found by computing the 
first K eigenvectors of ^^i/^M*^^/^. 



Instead of constructing M which is bigger and sparser 
than the original matrix A, we provide a way to co- 
cluster bipartite graph directly from A. 

Theorem 2. A relaxed solution to the bipartite graph 
co-clustering problem in eq. \12\ can be found by comput- 
ing the left and right eigenvectors of normalized version 
of A. 



Proof. Let 



X 
Y 



and $ 









*2 



(13) 



be rearranged into two smaller matrices that correspond 
to A and A-^ respectively. Then, eq.[T2]can be rewritten 
as: 



max Jft = — tr 
K 



X 
Y 



V 



" A 

A^ 


• 


X 
Y 



(14) 



M 



-1/2 . ^-1/2 

» ' A*n ' 



where A = ^^ ' A*2^ ' . Denoting X e £J^-ixj^ and 
Y G C^'*^ as the relaxed version of X and Y, by the 
Ky Fan theorem, the global optimum solution to eq. [T3] 
is given by the first K eigenvectors of M: 



(15) 







X 

Y 


= 




Xl,. 

. yi'- 


..,xk 

■ ■,yK 


Q 


Therefore, 










A 


I 

L^ ( 


) 




Xfe 

y/c 


-Afe 


Xfe 

yfc 



(16) 



where k G [i-,K] and Afe denotes fc-th eigenvalue of M. 
Then, 



Ay A; = AfeXfe, and 
A^Xfc = AfeYfe. 



(17) 
(18) 



Thus, a relaxed global optimum solution to the problem 
in eq. [T2]can be found by computing the first K left and 
right eigenvectors of A. □ 

Theorem [2] generalizes the work of Dhillon [15] where 
the author only gives a theoretical explanation for 2-way 
bipartite graph co-clustering. And the multipartition- 
ing algorithm proposed by the author [TS] that derived 
from the bipartitioning algorithm by induction, now has 
a theoretical explanation. 

The following theorem gives a support for an interest- 
ing claim in co-clustering: row clustering implies column 
clustering and vice versa. 



Theorem 3. Solving simultaneous row and column clus- 
tering is equivalent to solving row and column clustering 
separately, and consequently, row clustering implies col- 
umn clustering and vice versa. 

Proof. By substituting y^, from eq. [18] into eq. 1171 ^-nd 
similarly, substituting x^ from eq. [T7]into eg. 1181 we get: 



AA^Xfc 



AfeXfe, 



and 



A^Ayfc = At-yfc, 



(19) 
(20) 



where AA-'" and A-'^A respectively denote row and col- 
umn affinity matrices. After some manipulations, we 
get: 



max tr fx^AA^X 

4 



K 



max tr Y' A^ AY 



^A^, and 
fe=i 

K 



(21) 



(22) 



fe=i 



where X = [xi, . . . , xa'] and Y = [yj^, . . . , x/f] respec- 
tively denote the relaxed row and column clustering in- 
dicator matrices. As shown above, X and Y can be 
computed separately, and since Y can be derived from 
X and vice versa (see eq. [17] and eq. [T8|l , row clustering 
implies column clustering and vice versa. D 

Theorem [5] provides a "shortcut" to computing X and 
Y which are usually be constructed by computing the 
first K eigenvectors of AA"^ and A^A respectively. 

Theorem 4. X and Y can be constructed by computing 
the first K left and right eigenvectors of A.. 

Proof. As shown in the proof of theorem [S] X = 
[xi, . . . , Xif] and Y — [yj^, . . . , y^], where according to 
the proof of theorem [2] x/j and y^, are the fc-th left and 
right eigenvectors of A. D 

4.2.2 Indirect treatment 

There are cases where the data points are inseparable 
in the original space or clustering can be done more ef- 
fectively by first transforming A into a corresponding 
symmetric matrix V G R_|_ ^ (we assume item cluster- 
ing for the rest of this subsection, feature clustering can 
be done similarly). Then the graph cuts can be applied 
to ^(V) to obtain the item clustering. 

There are two common approaches to learn V from 
A. The first approach is to use kernel functions. Table 
[2]lists the most widely used kernel functions according to 
Dhillon et al. [M] with a^ is i-th column of A, and the un- 
known parameters (c, d, a, and 6) are either directly de- 
termined based on previous experiences or learned from 
sample datasets. 



Table 2: Examples of popular kernel functions [141 . 



Polynomial kernel 


K.{ai,aj) = (ai ■ aj + cY 


Gaussian kernel 


K(ai,aj) =exp(-||ai -aj|p/2a2) 


Sigmoid kernel 


K{ai,aj) = tanh(c(ai ■ a^) + 9) 



The second approach is to make no assumption about 
the data domain nor the possible similarity structure 
between item pairs. V is learned directly from the data, 
thus avoiding some inherent problems associated with 
the first approach, e.g., (1) no standard in choosing the 
kernel function and (2) similarities between item pairs 
are computed independently without considering inter- 
actions among items. Some recent works on this ap- 
proach can be found in [TTl [T^J [13] ■ 

Proposition 4. Clustering on ^(V) can be stated in the 
trace maximization ofV. 

Proof. If the first approach to be used, entries of V can 
be determined using a kernel function. 



V^,^ 



K(aj,aj) iii^j 
iii=j 



(23) 



Similarly, if the second approach to be used, V can be 
learned directly from the data. Then, by using GWAs- 
soc as the objective, K-way clustering on Q (V) can be 
computed by: 



max Jb = ^triz'^^-^/'^V^-^/'^Z 



(24) 



where Z and # are defined equivalently as in the uni- 
partite graph case. D 

If asymmetric metrics like Bregman divergences are 
used as the kernel functions, the resulting V will be 
asymmetric. Accordingly, ^(V) is a directed graph, and 
therefore it must be treated as a directed graph. 

4.3 Directed graph clustering 

The researches on directed graph clustering come from 
complex network studies conducted mainly by physi- 
cists. Different from conventional method of ignoring 
the edge directions, complex network researchers pre- 
serve this information in their proposed methods. As 
shown in [T] [H] , accomodating it can be very useful in 
improving clustering quality. In some cases, ignoring the 
edge directions can lead to the clusters detection failure 

The directed graph clustering usually is done by map- 
ping the original square affinity matrix into another 



square matrix which entries are adjusted to emphasize 
the importance of the edge directions. Some mapping 
functions can be found in, e.g., [71 [161 [17]. To make use 
of the available clustering methods for unipartite graph, 
some works |7l I17j construct a symmetric matrix rep- 
resentation of the directed graph without ignoring the 
edge directions. 

Here we describe the directed graph clustering by nat- 
urally following the previous discussions on the unipar- 
tite and bipartite graph cases. 

Proposition 5. Directed graph clustering can be stated 
in the trace maximization problem of a symmetric ma- 
trix. 

Proof. Let B e K^ ^ be the affinity matrix induced 
from a directed graph, and $i and $o be diagonal weight 
matrices associated with indegree and outdegree of ver- 
tices in ^ (B) respectively. We define a diagonal weight 
matrix of Q (B) with: 



v*^ 



(25) 



Since both rows and columns of B correspond to the 
same set of vertices with the same order, the row and 
column clustering indicator matrices are the same, ma- 
trix Z. By using GWAssoc, K-way clustering on Q (B) 
and Q (B-'") can be found by: 



J,^ = -^tr ( Z^*,7^/^B*„^/'Z ) , and (26) 



J.. = ^tr(z-*-/^B-*-/^z) 



(27) 



respectively. By adding the two objectives above, we 
obtain: 



max 



J, = Itr (Z^*-;/^ (B + B^) *,ri/^Z ) , (28) 



which is the trace maximization problem of a symmetric 



matrix # - 



B + B^ * 



TA A-l/2 



D 



The directed graph clustering raises an interesting is- 
sue in the weight matrix formulation which doesn't ap- 
pear in the unipartite and bipartite graph cases as the 
edges are undirected. As explained in the original work 
[13], $ is introduced with two purposes: first to provide 
a general form of graph cuts objective which other ob- 
jectives can be derived from it, and second to provide 
compatibility with weighted kernel if -means objective 
so that eigenvector-free K-iaeans algorithm can be uti- 
lized to solve the graph cuts problem. 

However, as information of the edge directions ap- 
pears, defining a weight for each vertex is no longer ad- 
equate. To see the reason, let's apply NAssoc to Q (B) 



and g (B^) . By using table HJ 

max Jdi = -^tr (z'^I)-^^'^BT>-^^^Z\ and (29) 



max J,2 = -tr ( Z''B*-'/'B''r>*-'/'Z 



(30) 



where D and D* are diagonal weight matrices with 
Da = J2j Bij and D*^ = ^^ Bij respectively. But now 
Jdi + Jd2 won't end up in a nice trace maximization of 
a symmetric matrix as in eq. 1281 Therefore, we can- 
not apply the Ky Fan theorem to find a relaxed global 
optimum solution. 

This motivates us to define a more general form of the 
weight matrix, ^lo , which allows directed graph cluster- 
ing be stated in the trace maximization of a symmetric 
matrix, yet still turns into $ if the corresponding affinity 
matrix is symmetric. 

In the case of NAssoc and NCuts, $i and $o are de- 
fined as: 

*, = diag|^S,i,...,X!^»A^) and (31) 
*o = diagK]i?i„...,^i?Ar, I . (32) 



Note that there is no need to define weight matrix for 
RAssoc and RCuts since I is used. 



5 Extension to the Ky Fan Theorem 

Theorem [3 implies an extension to the Ky Fan theorem 
for more general rectangular complex matrix. 

Theorem 5. The optimal value of the following prob- 
lem: 



max tr(X^RY), 



K 



equal to X]a:=i ^k if 



X = [xi, . . . ,x/i-]Q, and 
Y=[yi,...,yK]Q 



(33) 



(34) 
(35) 



where R e 



^MxN 



denotes a full rank rectangular com- 



plex matrix with eigenvalues \i > 



> A, 



i(M,Af) 



R+,0< K < min(M,7V), X e C*^^-^ and Y e C^><-^ 
denote unitary matrices, Xk andyk (k € [1,K]) respec- 
tively denote k-th left and right eigenvectors correspond 
to Xk, and Q G C^^^ denotes an arbitrary unitary ma- 
trix. 



Proof. Eg. l33lcan be rewritten as: 



1 
max — tr 

xrx=YrY=lK 2 



X 
Y 



V 



R 
R^ 



X 

Y 



/ 



(36) 

Since ^ is a Hermitian matrix, by the Ky Fan theo- 
rem, the global optimum solution is given by the first K 
eigenvectors of ^: 



X 

Y 



xi, 



■ ■,yK 



Q 



(37) 



By following the proof of theorem [51 it can be shown 
that Xi , . . . , xji and yi , . . . , yk are the first K left and 
right eigenvectors of R. D 

Interestingly, theorem [S] can also be proven by using 
the SVD definition. 

Proof. Without loosing generality, let assume N < M 
R =USV^ 



=Ui....,kSi,...,kV?;. 



'^k+i,...,n'^k+i,...,n^k+i 



.N^ 



(38) 



where U and V defined as in section [31 Va,....b 
and Vq ..._fc denote matrices built by taking column 
a to 6 from U and V respectively, and S^ ;, = 
diag[AQ,..., Afc]. Then, 



U?;...,kRVi„ 



,K 



Si, 



,K, 



or more conveniently. 



VIRVk = s 



K- 



(39) 



(40) 



Therefore, 
tr (U^RVk) 



K 

fe=i 



max tr(X^RY). 

X^X=YrY=Ijf 



(41) 
D 



Theorem[5]is the general form of theorem[2]and gives a 
theoretical support for directly applying the graph cuts 
on the data matrix A G M_|_ ^ to get simultaneous row 
and column clustering: 



max tr(X^ AY), 

X^X=YrY=lK 



(42) 



where X S 



oMxK 



and Y e 



pNxK 



denote the row and 



column clustering indicator matrices respectively. 



6 Related works 

Zha et al. [T] and Ding et al. [5] mention the Ky Fan 
theorem in their discussions on the spectral clustering. 
However, the role of the theorem in the spectral cluster- 
ing can be easily overlooked as it is not clearly described. 

The equivalences between iiT-means clustering and 
several graph cuts objectives to the trace maximization 
objectives are well-known facts in the spectral clustering 
researches as many papers discuss about it with excep- 
tion for the directed graph case, as this problem arises 
from complex network researches. Some representative 
works are [H [1 [3 [H [H] . 

Leicht et al. [7] discuss how to extend the so-called 
modularity — which is equivalent to the graph cuts 
objective — of unipartite graph to directed graph. They 
form an asymmetric modularity matrix B* G M._^_ ^ by 
applying modularity function to emphasizes the impor- 
tance of the edge directions to the original asymmetric 
affinity matrix B £ 'R^^^ , and then transform B* into 
a symmetric matrix by adding B* to its transpose. The 
clustering is done by calculating the first K eigenvectors 
of this symmetric matrix. This is equivalent to applying 
RAssoc to (B*-hB*^). 

Kim et al. [T7| propose a method for transforming the 
affinity matrix induced from a directed graph into a sym- 
metric matrix without ignoring the edge directions. So, 
clustering algorithms built for unipartite graph can be 
applied unchanged. 

7 A note on spectral clustering algo- 
rithms 

There are many spectral clustering algorithms available. 
They are different in many aspects, from the chosen 
affinity matrices to the postprocessing methods to de- 
rive clustering from eigenvectors. According to Luxburg 
[6] , the most popular ones are algorithms by Shi et al. [3] 
and by Ng et al. [4], with the former is more favorable 
because the computed eigenvectors are more related to 
the clustering indicator vectors. 

Here we like to note that according to Dhillon et 
al. [13], a state-of-the-art spectral clustering algorithm 
based on the work of Yu et al. [5] empirically performed 
the best among various spectral algorithms that were 
tested in the terms of optimizing the objective func- 
tion values. Furthermore, the multilevel algorithm pro- 
posed in |14| — which exploits the equivalences of vari- 
ous graph clustering objectives to weighted kernel K- 
means objective to eliminate the need for eigenvectors 
computation — shows very promising results which while 
moderately improving clustering quality, drastically im- 
proving computational speed (up to 2000 times faster 



than the spectral method) and memory usage. 

8 Conclusion 

We presented a concise explanation on the logic behind 
the spectral clustering. Unlike ii'-means clustering and 
graph cuts which are very intuitive and straightforward, 
the spectral clustering tends to be incomprehensible. By 
using the Ky Fan theorem, we showed that the spectral 
clustering has a simple explanation and is also intuitive. 

We showed how to treat K-w&y clustering on unipar- 
tite, bipartite and directed graphs as the trace maxi- 
mization problems on the corresponding symmetric ma- 
trices, thus a unified treatment can be applied to those 
graphs. 

In bipartite graph, we proved that the co-clustering 
can be obtained by computing the left and right eigen- 
vectors of the corresponding feature-by-item data ma- 
trix, thus generalizing the result of Dhillon [15j and pro- 
viding a theoretical basis for spectral co-clustering algo- 
rithms proposed in, e.g., [13 HI]. We also proved that 
solving simultaneous row and column clustering is equiv- 
alent to solving row and column clustering separately, 
thus giving a theoretical support for the claim: "column 
clustering implies row clustering and vice versa" , and 
then gave a "shortcut" to compute the row and column 
clustering indicator matrices. 

In directed graph, we described a new clustering ob- 
jective by following the discussions on unipartite and 
bipartite graphs naturally. 

By extending theorem [5] to complex domain, we gen- 
eralized the Ky Fan theorem to rectangular complex ma- 
trix. The second proof of theorem [5] shows that this the- 
orem is a corollary of the SVD formulation, and thus the 
Ky Fan theorem and its general form are the corollaries 
of the Eckart- Young theorem. 

We must note that, however, as the mathematics be- 
hind the spectral clustering has a long story (the Ky Fan 
theorem itself was proposed in 50's), it is probable that 
the contributions in this paper are not new, or can be de- 
rived easily from other well-established facts, theorems, 
or definitions. 
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